Policy Gradient
Policy Based Reinforcement Learning
Previously, we approximated the value function or action-value function.
The policy was derived from the value function, this time we will directly approximate the policy.
In the context of model-free reinforcement learning.
-
Value Based RL:
- Learn the value function and derive the policy from the value function.
- policy is implicit in the value function. (
)
-
Policy Based RL:
- No value function.
- Learn the policy directly.
Why we prefer policy-based methods over value-based methods?
The value based methods store the value of each state or state-action pair. The policy based methods store the policy directly, which is more efficient for high-dimensional or continuous action spaces.
However, they typically converge to a local optimum, and evaluating the policy is typically inefficient with high variance.
Policy Objective Functions
The objective is with given policy
In episodic environments, the start value can be used
In continuing environments, the average value can be used
where
since the aim is to find
Policy Gradient
Let
where
Finite difference methods can be used to estimate the gradient. Basically, we can estimate the gradient by changing the parameter
Actor-Critic Policy Gradient
Monte Carlo methods can be used to estimate the gradient, but it has high variance.
The actor-critic method uses a critic to estimate the value function and an actor to learn the policy.
- Critic: The critic updates the action-value function parameters
to minimize the error. - Actor: The actor updates the policy parameters
, in the direction suggested by the critic.
The critic solves policy evaluation problem, and estimates the value function.
The actor updates policy parameters, which is considered as the policy improvement step.
The bias can be avoided by using an appropriate function approximator.
Eligibility Traces Eligibility traces are a key concept in reinforcement learning that help bridge the gap between Monte Carlo methods and Temporal Difference (TD) learning. They allow for more efficient learning by combining the strengths of both approaches. When used with policy gradients, eligibility traces can help improve the learning process in policy-based reinforcement learning algorithms.
Eligibility Traces with Policy Gradient When combining eligibility traces with policy gradient methods, the objective is to adjust the policy parameters in a way that takes into account the temporal structure of the problem. The key idea is to use eligibility traces to assign credit to actions based on their contribution to future rewards. This can be particularly useful in environments with delayed rewards.
-
Implementation : Eligibility Traces Update: At each time step, update the eligibility traces for the actions taken. This typically involves decaying the traces over time and adding new traces for the current actions.
-
Policy Parameter Update: Use the eligibility traces to adjust the policy parameters. The adjustment is based on the gradient of the policy with respect to the parameters, scaled by the eligibility traces.
#MMI706 - Reinforcement Learning at METU